[00:01.420 --> 00:03.440]  Done. Okay, is that better?
[00:04.500 --> 00:05.200]  Yes.
[00:05.200 --> 00:05.720]  Nice.
[00:05.880 --> 00:06.780]  That's good.
[00:06.780 --> 00:10.200]  Nailed it. Thank you for hanging in there.
[00:12.620 --> 00:15.460]  Okay, so I want to ask, what isn't machine learning?
[00:15.820 --> 00:23.760]  So, as math underpins everything we do,
[00:23.760 --> 00:28.320]  I think you get to a point where everything is machine learning.
[00:28.320 --> 00:35.660]  So, if you think about when you're a child and you're trying to learn to walk,
[00:35.660 --> 00:42.280]  it doesn't look that unfamiliar to when someone has reinforcement learning algorithms
[00:42.280 --> 00:45.160]  that are teaching something to walk.
[00:46.160 --> 00:49.300]  So from an offensive standpoint, I think it's an important distinction
[00:49.300 --> 00:56.000]  because where offensive folks try to live in the realm of possibility
[00:57.360 --> 01:04.060]  rather than in a box that we try to define.
[01:04.060 --> 01:07.900]  So, maybe if you start thinking of everything as machine learning,
[01:07.900 --> 01:11.880]  you're going to find more opportunities to attack machine learning models,
[01:11.880 --> 01:15.080]  use them in your ops, whatever that might be.
[01:15.740 --> 01:23.280]  But for this attack in particular, we're going to do a copycat model.
[01:23.280 --> 01:28.060]  And I think this kind of concept will come out as we go through it.
[01:28.840 --> 01:31.940]  So, I'm kind of tired of the name copycat model,
[01:31.940 --> 01:34.740]  so I'm just going to call them Pink Panther attacks from now on.
[01:35.220 --> 01:39.780]  One, it's more fun. Two, it's still kind of cat-like.
[01:40.120 --> 01:45.720]  And three, I know Pink Panther involves a diamond that is always trying to be stolen.
[01:46.260 --> 01:50.020]  And preconditions for a successful attack, and these are pretty loose,
[01:50.020 --> 01:53.060]  are a representative data set.
[01:53.380 --> 01:57.240]  So, for example, in the case of an AMSI provider,
[01:57.240 --> 02:03.900]  we're looking for like PowerShell, VBA code,
[02:04.440 --> 02:07.280]  a big representative data set that we want to model.
[02:07.620 --> 02:11.960]  And we need the ability to get feedback from our target model.
[02:11.960 --> 02:15.760]  And this doesn't have to be direct output.
[02:15.760 --> 02:21.840]  So, if you imagine that a model doesn't give you back a score,
[02:21.840 --> 02:24.580]  but it keeps it in a Windows event log,
[02:24.580 --> 02:29.000]  you still get your hard label, hard label being a 0 or a 1.
[02:29.860 --> 02:33.860]  So, even if the model doesn't give you back output,
[02:33.860 --> 02:41.700]  there are still, just due to the telemetry nature of our networks,
[02:41.700 --> 02:46.900]  it's more likely than not that somewhere output's being recorded.
[02:47.420 --> 02:50.840]  So, rather than maybe going directly to the model,
[02:50.840 --> 02:56.220]  try and think of some binary test that you can perform
[02:57.240 --> 03:02.760]  that is potentially outside of the realm of a machine learning model.
[03:03.540 --> 03:07.600]  So, it's a pretty simple attack once you kind of get into it.
[03:07.600 --> 03:10.960]  Effectively, you grab a massive...
[03:10.960 --> 03:15.300]  in this case, we found a big PowerShell corpus.
[03:15.640 --> 03:19.740]  So, there's like 400,000 scripts.
[03:20.020 --> 03:24.760]  And the workshop, the lab piece of it, we'll talk about that.
[03:25.020 --> 03:30.800]  Effectively, the scripts get fed into an AMSI integration.
[03:32.080 --> 03:34.980]  And this just came off of the Windows sample.
[03:34.980 --> 03:39.700]  So, all it does is load AMSI.dll and instantiate the com object
[03:40.400 --> 03:46.600]  and we feed it every script in that corpus.
[03:47.000 --> 03:52.940]  And that gets passed to Defender or whatever AMSI provider that might be.
[03:53.280 --> 03:56.140]  And it comes back with a score.
[03:57.120 --> 04:01.460]  And then we turn around and we just collect those scores offline.
[04:02.400 --> 04:08.420]  And once we have known inputs and we know what the output...
[04:08.420 --> 04:13.740]  Defender thinks of the output, we can model our model.
[04:14.560 --> 04:22.240]  And the hardest part about, I think, adversarial machine learning
[04:22.240 --> 04:24.280]  isn't the machine learning piece.
[04:24.720 --> 04:27.940]  Most of us don't have to invent math.
[04:27.940 --> 04:34.620]  People much smarter than myself have already done that for me.
[04:34.620 --> 04:37.840]  But what is difficult is the engineering piece.
[04:37.840 --> 04:40.420]  It's getting access to the right data.
[04:40.720 --> 04:46.280]  It's getting data in a way that isn't going to get you caught
[04:46.280 --> 04:51.300]  or you're going to distribute your traffic accordingly.
[04:51.760 --> 04:54.700]  And you can kind of fly under the radar.
[04:54.700 --> 04:57.800]  In the future, currently attackers kind of have free reign
[04:58.440 --> 05:02.760]  and get away with quite a lot of querying and noise in the network.
[05:02.760 --> 05:04.680]  That's not always going to be the case.
[05:05.120 --> 05:11.380]  And I think it is helpful for us to start thinking about limiting our queries,
[05:11.380 --> 05:13.900]  thinking about information density of a command,
[05:14.760 --> 05:19.140]  a ping versus a process list, for example.
[05:20.880 --> 05:24.260]  Ping, you're going to get if a host is up or not.
[05:24.260 --> 05:26.780]  Or you're going to get TTL, which is going to tell the OS.
[05:27.200 --> 05:30.360]  But a process list, you're going to get that and more.
[05:30.920 --> 05:33.040]  But it probably just depends on what you're after.
[05:34.840 --> 05:39.500]  And from a machine learning perspective and an offensive perspective,
[05:39.500 --> 05:41.100]  anything can be modeled.
[05:41.100 --> 05:49.120]  So if defenders are using machine learning to model whatever logs they're looking for,
[05:49.120 --> 05:54.520]  there's nothing to say that we as offensive folks can't also use machine learning
[05:54.520 --> 05:57.320]  to say model CT traffic.
[05:57.320 --> 06:01.600]  Or let's say you have semantic, you have some product that is blocking you
[06:01.600 --> 06:05.260]  and you can't quite pinpoint it.
[06:05.880 --> 06:08.200]  Machine learning will be able to do that for you
[06:08.200 --> 06:11.240]  if you can set up the right experiment.
[06:11.300 --> 06:18.780]  So even though Semantic is not using machine learning necessarily to look for your CT traffic,
[06:18.780 --> 06:25.360]  you can still use it to find those relationships between whatever domain
[06:25.360 --> 06:34.820]  or whatever callback interval that you're using against whatever product it might be.
[06:35.500 --> 06:39.340]  So I guess I just come back to the question of what isn't machine learning.
[06:41.840 --> 06:48.280]  And in my mind, everything is machine learning if you kind of go back to the math,
[06:48.280 --> 06:50.580]  which is kind of fun.
[06:50.720 --> 06:58.240]  Okay, so in the lab, the workshop, all the code is in there for you.
[06:58.240 --> 07:06.280]  So the first half, you'll step through and it'll be very explanatory.
[07:07.380 --> 07:12.840]  And then as we go through, it'll be less and less explanation.
[07:12.840 --> 07:18.800]  So you'll be required to kind of look more, ask more questions, etc., etc.
[07:19.380 --> 07:23.360]  Break stuff, like you're not going to break it, you can always reset it.
[07:23.360 --> 07:25.720]  Google it, ask us for help.
[07:25.720 --> 07:34.640]  AI Village has easily, probably the highest concentration of some of the smartest people in the industry.
[07:34.820 --> 07:41.600]  And there's so many PhDs and super experienced people.
[07:41.600 --> 07:46.980]  So, you know, we love talking. I like talking offensive ops, but you know,
[07:47.580 --> 07:55.000]  a lot of people like talking stats and topology and homology and all this other stuff.
[07:55.600 --> 07:59.240]  The other day I learned that there is more than one algebra.
[08:01.760 --> 08:06.000]  But yeah, so this is the place to ask, so go ask for it.
[08:06.640 --> 08:13.380]  As you go through, I would challenge those of you who are more experienced to produce a better and more efficient attack.
[08:14.320 --> 08:19.800]  And in there, there is also a massive data set of VBA code that I found.
[08:19.800 --> 08:21.980]  It's like nine gigs of VBA code.
[08:22.640 --> 08:27.580]  So if you want, go through this code, it would work the same.
[08:27.580 --> 08:35.760]  And you could get your name on that leaderboard for CVs for machine learning systems.
[08:37.280 --> 08:40.400]  So if you want to do that, you can go for it.
[08:42.280 --> 08:46.680]  So feel free to ask questions in Discord. There are no dumb questions.
[08:47.620 --> 08:52.440]  And we have three TAs that are helping me out here.
[08:52.500 --> 08:56.100]  So if you have specific questions, we can ask for it.
[08:56.260 --> 09:00.280]  So hopefully you have the link.
[09:01.000 --> 09:02.540]  You have everything.
[09:03.980 --> 09:09.100]  So effectively, this is the... what do you call this?
[09:09.100 --> 09:11.080]  This is the workshop.
[09:13.180 --> 09:16.440]  And there are two files, two Python notebooks.
[09:16.440 --> 09:21.100]  One is the solutions, and one is the workbook.
[09:21.960 --> 09:23.740]  And you should start with the workbook.
[09:23.740 --> 09:32.940]  If you do get stuck, or you need code help, or you just want to grab it and go, the solutions one is going to work for you.
[09:32.940 --> 09:45.220]  This amsy.h file is if you do want to compile the amsy stream.exe so you can recreate the full attack path.
[09:46.000 --> 09:48.380]  You're going to need this header file.
[09:48.900 --> 09:53.640]  Collect.py is what we use to gather all the information.
[09:53.640 --> 09:57.580]  And then this insights.xlsx, this Excel file.
[09:57.580 --> 10:03.860]  This is the version 1 of the insights that we pulled out of Defender.
[10:04.520 --> 10:09.840]  So this would be like that proof point style insight.
[10:10.580 --> 10:15.760]  And it is our version 1, so this is it.
[10:15.760 --> 10:19.400]  So obviously you can see there's a lot of binary blobs in here.
[10:19.480 --> 10:22.220]  But these would be the most malicious tokens.
[10:22.860 --> 10:25.040]  And this makes sense to me.
[10:26.600 --> 10:33.240]  And at the bottom you have the least giant binary blob.
[10:34.900 --> 10:43.180]  And at the bottom you have the least malicious tokens.
[10:46.080 --> 10:48.240]  So that's free.
[10:50.560 --> 10:54.260]  So you guys can get started.
[10:54.260 --> 10:56.400]  Do you guys have any questions right away?
[10:59.220 --> 11:01.640]  I haven't seen any questions so far.
[11:03.840 --> 11:04.880]  Perfect.
[11:10.490 --> 11:11.790]  That's good.
[11:15.890 --> 11:18.150]  Is anyone else getting 403?
[11:32.220 --> 11:34.300]  I have to say I haven't used Defender.
[11:35.440 --> 11:37.660]  Is anyone else getting 403?
[11:45.650 --> 11:49.630]  We are not seeing any chat on Twitch or Discord.
[11:49.630 --> 11:54.550]  I just reopened the link and it's building the container for me.
[11:54.910 --> 11:56.030]  Yeah, me too.
[11:56.030 --> 11:57.310]  Okay, that's good.
[11:59.070 --> 12:01.030]  I have my own copy.
[12:01.050 --> 12:03.770]  I am in the Jupyter Notebook.
[12:03.850 --> 12:04.770]  Nice.
[12:05.110 --> 12:12.630]  Yeah, so the first little bit, I'll give you guys 10-15 minutes to rip through the first little bit.
[12:12.630 --> 12:18.850]  And if you guys have questions, let me know.
[12:21.570 --> 12:23.510]  Remote is always a bit weird.
[12:33.570 --> 12:36.610]  But, you know what always gets me in the mood?
[12:41.760 --> 12:42.740]  Can you guys hear that?
[12:42.740 --> 12:44.680]  What always gets you in the mood?
[12:45.540 --> 12:46.380]  Tron.
[12:47.140 --> 12:50.980]  Yeah, I don't think it's a good idea to stream copyrighted music on here.
[12:50.980 --> 12:52.900]  Oh, right, right, right.
[12:54.100 --> 12:59.860]  Twitch doesn't really have a problem with that, but the YouTube VOD at the end will be removed.
[12:59.980 --> 13:00.920]  That's fair.
[13:01.960 --> 13:04.600]  Yeah, I don't spend that much time on the internet.
[13:06.380 --> 13:10.220]  So, Will, I can send you the Notebook link again.
[13:10.220 --> 13:13.660]  The one that myself and other people are using on the chat.
[13:13.820 --> 13:16.240]  That's good. I have my own.
[13:42.000 --> 13:42.740]  Excellent.
[13:42.740 --> 13:46.720]  Okay, so I'm just going to talk.
[13:46.840 --> 13:55.260]  Everything is pretty self-explanatory in terms of the instructions.
[13:57.820 --> 14:01.420]  Can you zoom in a little bit? It's kind of hard to see on the screen.
[14:05.910 --> 14:07.070]  Is that better?
[14:09.610 --> 14:10.830]  One more.
[14:14.780 --> 14:16.080]  That should be good.
[14:17.840 --> 14:19.800]  Sorry, there's a one-year-old.
[14:20.320 --> 14:22.180]  I can see it very clearly.
[14:23.980 --> 14:28.900]  Excellent. So, I'm just going to talk. You guys feel free to run through.
[14:30.240 --> 14:36.880]  So, sort of the basis of this attack is really the AnyMalware scan interface.
[14:39.340 --> 14:45.060]  And Emsi is something they introduced in, I want to say, PowerShell 5.
[14:45.940 --> 14:47.720]  A little sooner than that.
[14:47.920 --> 15:02.580]  But effectively, it's just a DLL that gets called into PowerShell, UAC dialogs, any Jscript, VBScript, and now .NET 4.8 and beyond.
[15:03.500 --> 15:17.380]  And inside, it has a number of blacklisted functions that whenever one of those functions is called, Emsi will get called to scan content.
[15:18.560 --> 15:26.020]  And Emsi is not a security boundary. It's not a security product. It's an interface.
[15:26.020 --> 15:40.980]  And so, all it does is collect whatever content is being put into the buffer or the string, and it passes it back to an Emsi provider.
[15:41.360 --> 15:47.920]  And the Emsi provider then has an opportunity to scan the content and make some determination.
[15:47.920 --> 15:51.820]  So, obviously, this workshop is about Windows Defender.
[15:53.280 --> 16:02.300]  But there are several other providers, Emsi providers, that this attack would work against.
[16:04.180 --> 16:11.460]  And that's, I suppose, kind of the nice part about machine learning currently is there's a lot of attack surface.
[16:11.460 --> 16:21.080]  So, once an Emsi provider scans the content and decides what it is, there's a range of scores it can give.
[16:21.080 --> 16:27.160]  Currently, it only gives back 0 or 1, 32, 768.
[16:27.540 --> 16:32.060]  But in Microsoft documentation, they talk about a range of scores.
[16:32.060 --> 16:49.080]  So, if you think rather than a hard label as a 0 or a 1, eventually you'd get some sort of regression type continuous variable on the out.
[16:49.080 --> 16:52.700]  And so, you'd get some heuristic, I suppose.
[16:56.520 --> 17:05.000]  There's a great talk by, actually Microsoft, called Badly Behaving Scripts.
[17:05.000 --> 17:08.540]  It's probably like an hour long.
[17:08.600 --> 17:13.520]  But I think they gave it two years ago at Blue Hat IL.
[17:13.920 --> 17:17.760]  And it's two engineers from the machine learning team.
[17:17.760 --> 17:33.140]  And they're just discussing how Emsi works, the things that they're looking for, the kind of models that they're building, and so on.
[17:33.140 --> 17:40.640]  And so, this is kind of our first indication, at least officially, that machine learning is being pushed onto the client.
[17:40.640 --> 17:51.460]  And I think it would seem anyway that overtly malicious things get stopped on the client.
[17:51.460 --> 17:58.860]  But anything that's kind of in between seems to go to the cloud.
[17:58.860 --> 18:03.540]  One way to test this would be to have some sort of timing attack.
[18:03.540 --> 18:10.840]  You could get the response times for submission to the reception of the score.
[18:11.140 --> 18:22.600]  And you could maybe see if there's a significant difference between the worst scripts and the medium bad scripts.
[18:25.310 --> 18:28.330]  So, moving on. You guys can obviously read.
[18:29.910 --> 18:37.370]  So, the first thing we obviously needed to do to create a copycat is we needed a data set.
[18:37.370 --> 18:47.130]  And there's a talk by Lee Holmes and Daniel Bohannon in 2017 called Revoke Obfuscation, Powershot Obfuscation Detection Using Science.
[18:48.210 --> 18:59.370]  So, Daniel Bohannon wrote Invoke Obfuscation, which basically takes a script or any number of different languages
[18:59.790 --> 19:03.550]  and it puts it through an obfuscation.
[19:03.950 --> 19:08.390]  So, it breaks it up into a million different pieces.
[19:08.390 --> 19:12.990]  So, it'll break those regexes, those brittle detections.
[19:12.990 --> 19:15.470]  It's an awesome, awesome tool.
[19:15.470 --> 19:17.690]  These guys wrote it, wrote the talk.
[19:17.690 --> 19:23.090]  And part of that research, they collected like two gigs worth of PowerShell scripts.
[19:23.550 --> 19:29.570]  And they labeled them both benign and malicious for us.
[19:30.310 --> 19:33.470]  And so, they're already... you can actually... the link's here.
[19:33.470 --> 19:36.130]  So, you can go pull it down and rip through it.
[19:37.790 --> 19:46.070]  Once you have that, the rest of it is just getting target outputs from the target models.
[19:46.110 --> 19:54.810]  So, if we want to create a copycat, we want to know what Defender thinks of each script that we give it.
[19:55.090 --> 19:58.110]  And then, that is remodeling the model.
[19:58.110 --> 20:00.030]  So, we have the inputs and we have the outputs.
[20:00.030 --> 20:01.770]  We don't know what's in the middle.
[20:01.830 --> 20:03.210]  We don't know that black box.
[20:03.210 --> 20:08.590]  But we can infer it with our own model.
[20:08.630 --> 20:11.530]  And you'll never get 100% of the model.
[20:11.530 --> 20:24.790]  But you might get just enough to bypass Windows Defender for a month or just one time or whatever it might be.
[20:24.790 --> 20:37.290]  But the nice part about machine learning and the struggle that machine learning has, I think, is it does introduce a probability into what used to be a static decision.
[20:40.850 --> 20:45.050]  The large repository of something interesting is VBA.
[20:45.050 --> 20:50.730]  So, that's just 9 gigs of Excel macros that you could do the same with.
[20:52.850 --> 20:54.610]  Which would be pretty awesome.
[20:58.830 --> 21:02.750]  We have... let's see, what is this?
[21:02.750 --> 21:07.370]  This is the... you're not on Windows.
[21:08.250 --> 21:13.150]  If this were like Lab Lab, we'd have Windows VMs and we'd do this for real.
[21:13.390 --> 21:16.310]  This is just the output from the AMSI stream.
[21:16.430 --> 21:22.270]  So, InvokeWMIBackdoor looks malicious, but Defender doesn't think it's malicious.
[21:22.270 --> 21:32.590]  So, if you look down, you see ScanResult is 1, IsMalware is 0, IsMalware is the official feedback loop.
[21:33.410 --> 21:39.730]  And even in that talk that is above, they discuss the fact that you can't trust filenames.
[21:39.730 --> 21:43.330]  And that makes sense. You can't trust headers anymore.
[21:43.970 --> 21:55.550]  There's some research that did a number of, like 3 years ago now, against mail filters where you would null out the first 2 bytes of a .m file.
[21:56.150 --> 22:06.430]  And depending on how the mail filter decided what kind of file it was, it would either block it or let it through.
[22:06.430 --> 22:09.550]  Because it can't read the magic bytes at the top.
[22:10.330 --> 22:14.310]  And then, obviously the document is corrupt at that point.
[22:14.310 --> 22:19.930]  But what Windows does is when you open the .m file, it would ask you to repair it.
[22:20.410 --> 22:29.790]  And if the user clicked yes, I'd like to repair this document, your macro would live and you could get code execution.
[22:31.790 --> 22:36.510]  Right, but the next bit is the provider display name.
[22:36.510 --> 22:38.990]  So we have Microsoft Defender Antivirus.
[22:39.850 --> 22:41.790]  If this is...
[22:47.110 --> 22:54.810]  If Defender is turned off for whatever reason, or there is no AMZ provider, it will give you an error.
[22:55.090 --> 22:57.730]  And you will not get a score back.
[22:58.430 --> 23:01.770]  The next piece is just PowerView.
[23:01.770 --> 23:05.450]  So PowerView is pretty well known script, and it is definitely...
[23:05.450 --> 23:12.810]  Well, it's not explicitly malicious, but it is used by malicious people.
[23:14.390 --> 23:16.190]  That's not fair to say.
[23:16.250 --> 23:18.250]  It's used by attackers.
[23:18.630 --> 23:21.710]  I'm sure there are some nice attackers.
[23:23.910 --> 23:25.980]  So in this lab, there are...
[23:27.290 --> 23:32.410]  Uploading 380,000 scripts to GitHub was not popular with GitHub.
[23:32.990 --> 23:35.610]  So in this little bit, you only have 3,000.
[23:35.630 --> 23:39.010]  But the collect.py has everything you need.
[23:40.090 --> 23:44.210]  And when we were parsing, we just did...
[23:44.210 --> 23:47.390]  In your data, you have data clean and dirty.
[23:47.890 --> 23:50.070]  These would be your malicious scripts.
[23:50.070 --> 23:56.740]  And clean scripts are actually clean.
[23:56.740 --> 23:58.120]  There's like 3,000.
[23:58.120 --> 24:04.280]  There are about 1,200 malicious scripts that you can look at.
[24:04.740 --> 24:08.640]  Yeah, if you guys want to start ripping through this code.
[24:11.000 --> 24:15.080]  As we were parsing, you can look through collect.py.
[24:15.300 --> 24:18.200]  But I personally like to keep...
[24:18.200 --> 24:21.380]  So there's a lot of moving parts in machine learning.
[24:21.380 --> 24:23.560]  You kind of can't get away from it.
[24:24.060 --> 24:28.560]  So whenever possible, I like to keep at each step.
[24:28.560 --> 24:33.060]  I'll build a data structure and I'll keep the previous output.
[24:33.880 --> 24:35.320]  And I think it works nicely.
[24:35.320 --> 24:37.360]  Most of my datasets are pretty small anyway.
[24:37.440 --> 24:39.280]  So it's manageable.
[24:39.620 --> 24:42.820]  But you can kind of see the GetScreenshot.
[24:42.820 --> 24:46.800]  So this is the file name, the original file name.
[24:46.800 --> 24:49.580]  This is the hash, the MD5 hash of it.
[24:49.580 --> 24:53.760]  And then you have the result, being its malware.
[24:54.000 --> 24:59.080]  And then you have the base64 encoded text that you can look through.
[24:59.080 --> 25:02.720]  I like this because if there's any weirdness,
[25:02.720 --> 25:05.960]  you have everything you need right in front of you.
[25:06.000 --> 25:09.640]  You don't have to start again at a particular point.
[25:09.640 --> 25:12.320]  You don't have to go back to the beginning of the process.
[25:13.500 --> 25:17.580]  Being able to debug all the way through your pipeline is...
[25:17.580 --> 25:20.500]  It might cost you some speed.
[25:22.280 --> 25:25.240]  But at least for me, it's fine.
[25:25.240 --> 25:27.460]  I'm not dealing with billions of anything.
[25:31.600 --> 25:34.140]  We just got discussed lists.
[25:34.340 --> 25:37.180]  So if you're not familiar with a list,
[25:37.180 --> 25:41.280]  which I think probably most of you are,
[25:41.860 --> 25:44.500]  raise your hand if you're familiar with lists.
[25:46.000 --> 25:48.420]  Nice, okay, so about 20% of you.
[25:52.420 --> 25:54.040]  Can't hear any giggling.
[25:54.840 --> 25:57.600]  Alright, so you have lists. Lists are really nice.
[25:57.600 --> 26:00.160]  So you can just put them on any delimiter you want.
[26:02.280 --> 26:04.860]  So quick question, Will.
[26:05.020 --> 26:09.380]  This is from the Twitch chat.
[26:10.200 --> 26:14.020]  Should he be able to view the PowerShell code?
[26:14.020 --> 26:17.320]  When they click on a file with clean or dirty,
[26:17.320 --> 26:19.620]  get a mess of characters but not a script.
[26:21.160 --> 26:23.760]  Yes, so you...
[26:23.760 --> 26:27.340]  Yeah, so in this little code block here,
[26:28.000 --> 26:29.200]  you can open it.
[26:29.200 --> 26:31.380]  And it is just a...
[26:31.380 --> 26:33.620]  So we've already parsed it for you.
[26:33.720 --> 26:35.960]  It was just going to take forever otherwise.
[26:36.320 --> 26:40.460]  So what you're going to see is the script name.
[26:41.080 --> 26:43.680]  So this is the original script name.
[26:43.680 --> 26:47.580]  The MD5 hash of the script content.
[26:48.550 --> 26:52.030]  The AMSI result, or I should say the Defender result.
[26:52.300 --> 26:54.760]  So what Defender thinks of the script.
[26:54.980 --> 26:59.900]  And then the Base64 encoded version of the script.
[27:03.900 --> 27:07.680]  Can you turn lines on?
[27:15.820 --> 27:17.940]  So this is what we're doing here.
[27:17.940 --> 27:20.380]  So we're splitting a file.
[27:20.380 --> 27:22.380]  So what we're doing is we're setting up...
[27:22.380 --> 27:25.900]  Eventually we're going to rip through all these files
[27:25.900 --> 27:28.340]  and we're going to build a big vocab.
[27:28.700 --> 27:30.360]  So this is a...
[27:30.360 --> 27:33.060]  I just pulled one out as an example.
[27:34.840 --> 27:35.920]  So let's see.
[27:36.820 --> 27:39.000]  So for example, if we want to reference...
[27:39.000 --> 27:40.740]  This is just list stuff.
[27:40.740 --> 27:42.080]  So obviously zero index.
[27:42.080 --> 27:45.140]  So we can reference it however we want.
[27:46.480 --> 27:49.260]  You can then reference with semicolon.
[27:49.300 --> 27:52.600]  You know, from index one to the end.
[27:52.600 --> 27:56.480]  Or from index one to the third.
[27:58.740 --> 28:01.260]  Or you can even do minus.
[28:01.260 --> 28:03.720]  So you can go to the end of the list.
[28:04.360 --> 28:06.740]  And just get the script content.
[28:07.560 --> 28:10.740]  But if we're going to look at the script content,
[28:10.740 --> 28:13.740]  we kind of want to know...
[28:15.580 --> 28:17.880]  It's interesting that getScreenshot
[28:19.000 --> 28:21.140]  is counted as malicious.
[28:22.120 --> 28:25.120]  So if we decode it,
[28:25.120 --> 28:26.820]  we can have a look and maybe...
[28:26.820 --> 28:28.500]  Maybe there's some knowledge that we have
[28:28.500 --> 28:33.800]  that we can think about as to why it might be malicious.
[28:37.780 --> 28:41.160]  So we're just using split to make it a little nicer.
[28:41.160 --> 28:42.740]  And there's a typo right there.
[28:44.980 --> 28:46.580]  It's a double typo now.
[28:48.780 --> 28:54.130]  Alright, so we're just going to run through it.
[28:54.190 --> 28:55.230]  Actually, no, it's not.
[28:57.770 --> 28:59.230]  So to make it nice, actually,
[28:59.230 --> 29:00.730]  this isn't the solutions one,
[29:00.730 --> 29:02.110]  so you guys already know this.
[29:04.330 --> 29:06.430]  So if you want to make it a little nicer,
[29:06.430 --> 29:08.310]  we're just going to split it again.
[29:09.670 --> 29:14.930]  And we're going to get a list of the things.
[29:14.930 --> 29:16.610]  So, looking through this list,
[29:16.610 --> 29:19.650]  in my mind, some of the more malicious things
[29:19.650 --> 29:21.770]  would be the AddTypeAssembly.
[29:22.590 --> 29:26.530]  It would be ConvertToInt32,
[29:28.470 --> 29:29.110]  NewObjectSystemIOMemoryStream,
[29:29.110 --> 29:31.450]  so everything like in-memory attacks,
[29:31.450 --> 29:32.950]  any sort of memory streams.
[29:32.950 --> 29:35.270]  And then at the bottom, you also have
[29:35.270 --> 29:37.790]  two Base64 string.
[29:39.370 --> 29:41.730]  I think Base64 is typically something
[29:41.730 --> 29:44.370]  that gets picked up quite easily.
[29:46.430 --> 29:47.750]  But if we wanted to figure out
[29:47.750 --> 29:50.690]  how this was...
[29:50.690 --> 29:52.570]  why this was malicious,
[29:52.570 --> 29:56.270]  or why Defender thought this was malicious,
[29:56.270 --> 29:57.750]  does anybody have any ideas
[29:58.850 --> 30:00.850]  that they want to type out?
[30:00.850 --> 30:13.620]  Is anybody typing? I can't see.
[30:19.720 --> 30:21.480]  Yeah, it doesn't seem like there is
[30:21.960 --> 30:24.020]  anybody typing on Discord right now.
[30:24.020 --> 30:24.740]  Nice.
[30:27.420 --> 30:29.840]  Yeah, so we're just going to go through the dictionary.
[30:30.560 --> 30:33.660]  So, this is just what I came up with,
[30:33.660 --> 30:35.500]  and it would be to try and determine
[30:35.500 --> 30:39.500]  the commonality between all the malicious scripts
[30:39.500 --> 30:43.500]  and see if... see what tokens...
[30:44.760 --> 30:47.180]  and when I say tokens, I mean words...
[30:47.720 --> 30:49.640]  words came to the top.
[30:50.100 --> 30:51.640]  And to do this, we're going to use...
[30:51.640 --> 30:53.040]  we're just going to do a dictionary.
[30:53.840 --> 30:55.440]  And it's kept, you know,
[30:55.440 --> 30:57.960]  key-value pair.
[30:58.600 --> 31:00.720]  So I'll let you guys run through that.
[31:11.360 --> 31:13.780]  And if you need to play around with dictionaries,
[31:13.780 --> 31:15.440]  they're pretty awesome.
[31:15.440 --> 31:18.860]  I like dictionaries,
[31:19.440 --> 31:21.220]  but there is one other data structure
[31:21.220 --> 31:25.240]  that I like even more than dictionaries.
[31:26.660 --> 31:27.640]  Excellent.
[31:27.640 --> 31:29.700]  So, we know what dictionaries are now.
[31:29.700 --> 31:32.120]  And we're just going to decode the content,
[31:32.120 --> 31:35.660]  and we're going to add it all to a dictionary.
[31:36.180 --> 31:38.700]  We're only going to do one script.
[31:39.100 --> 31:41.660]  And does anybody know why we might
[31:41.660 --> 31:44.920]  just only want to do one for now?
[31:53.660 --> 31:56.460]  Because it's good practice?
[31:56.460 --> 31:57.800]  Yeah, it's good practice.
[31:57.800 --> 32:01.120]  But when you are dealing with, I don't know,
[32:01.120 --> 32:02.980]  thousands of scripts,
[32:02.980 --> 32:05.320]  like half a million scripts,
[32:05.320 --> 32:08.680]  if you have something that's going to mess up
[32:08.680 --> 32:12.960]  halfway through, or you start down that path,
[32:15.140 --> 32:18.280]  you're just going to waste a lot of time fixing errors.
[32:18.280 --> 32:20.400]  So, if you can get it to work with one,
[32:20.400 --> 32:24.640]  and then ten, and then fifty, and then a thousand,
[32:24.640 --> 32:27.220]  I think that's a much better way to go about
[32:29.960 --> 32:34.300]  processing large data, large data sets.
[32:34.300 --> 32:36.140]  So, for this one we're just going to do one,
[32:36.140 --> 32:39.120]  but we're going to split it twice,
[32:39.120 --> 32:42.300]  and then we're going to just add everything
[32:42.300 --> 32:45.520]  to the word index.
[32:45.520 --> 32:49.560]  So, there's a fill-in-the-blank bit here.
[32:50.120 --> 32:52.020]  So, if you guys want to go ahead and run that code,
[32:52.020 --> 32:55.000]  and then tell me what the issue is.
[33:02.290 --> 33:03.430]  And I'll give you clues.
[33:03.430 --> 33:05.630]  It's on the line with all the question marks.
[33:07.230 --> 33:09.670]  I didn't know that I needed to be this close to the microphone.
[33:21.940 --> 33:24.960]  Or you guys are already way, way, way past this.
[33:40.850 --> 33:44.090]  So, to get a proper count of the words
[33:44.090 --> 33:46.590]  as we go through them, we need to just add one.
[33:47.610 --> 33:49.470]  And you would have been able to tell
[33:50.830 --> 33:52.610]  after you went through it,
[33:52.610 --> 33:59.950]  because there's only one token listed for each.
[34:02.940 --> 34:07.400]  But the script has at least more than one.
[34:07.400 --> 34:09.740]  So, if we just fix that,
[34:14.110 --> 34:16.510]  now we get a better representation
[34:16.510 --> 34:18.550]  of what's out there.
[34:20.030 --> 34:22.770]  Going one at a time, I guess I've found
[34:22.770 --> 34:25.870]  I used to have a real issue with loops,
[34:25.870 --> 34:30.750]  where I wouldn't break them up properly.
[34:31.670 --> 34:34.610]  And so, doing it a little more slowly,
[34:34.610 --> 34:36.630]  at least one at a time, has helped me.
[34:37.470 --> 34:39.710]  And then we're just going to sort the dictionary
[34:39.710 --> 34:41.610]  by token.
[34:42.010 --> 34:46.010]  So now, this equals sign,
[34:46.010 --> 34:48.070]  there's nine of them,
[34:48.070 --> 34:51.130]  there's six curly brackets,
[34:51.130 --> 34:53.350]  there's four new objects,
[34:55.170 --> 34:59.690]  but nothing particularly malicious.
[35:00.110 --> 35:04.950]  So there is a lot of punctuation in there,
[35:06.210 --> 35:08.970]  but we can deal with that later.
[35:08.970 --> 35:12.010]  So, now, because we're impatient,
[35:12.010 --> 35:13.950]  and we just want to get to the machine learning bit,
[35:13.950 --> 35:18.090]  we're going to run through the entire...
[35:19.090 --> 35:21.630]  all of the malicious scripts.
[35:26.890 --> 35:29.670]  And it might take a little bit on Binder.
[35:37.010 --> 35:38.890]  Might even take a little bit here.
[35:40.930 --> 35:42.250]  The... yeah.
[35:42.630 --> 35:44.630]  I wish I... I tried to ship
[35:45.150 --> 35:48.130]  all of the scripts to you.
[35:53.180 --> 35:55.700]  Okay, so now we're just going to sort the words.
[35:55.700 --> 35:57.780]  So this is all of the tokens
[35:57.780 --> 36:00.900]  for all of the malicious scripts.
[36:05.000 --> 36:07.040]  And it isn't ideal.
[36:13.510 --> 36:15.910]  So there's a lot of numbers,
[36:15.910 --> 36:17.190]  and this is fairly common
[36:17.190 --> 36:19.310]  when you're tokenizing text.
[36:20.070 --> 36:23.050]  And I would generally just say go a little slow.
[36:23.050 --> 36:25.710]  But, nice bit is, we're seeing get proc address.
[36:26.370 --> 36:29.070]  Does anybody know...
[36:29.690 --> 36:31.570]  you know, in what operation
[36:31.570 --> 36:33.690]  get proc address might be used?
[36:38.070 --> 36:40.490]  Does anybody that works in an AV vendor
[36:40.490 --> 36:43.190]  know what get proc address might be used for?
[36:53.010 --> 36:55.050]  You can Google it if you want.
[37:03.250 --> 37:04.710]  No? Nobody?
[37:05.230 --> 37:06.150]  GT?
[37:07.270 --> 37:10.150]  Any penetration testers?
[37:10.950 --> 37:13.450]  Malware authors? Anybody?
[37:15.640 --> 37:17.460]  I'm a data scientist.
[37:17.700 --> 37:19.680]  That's fair.
[37:20.120 --> 37:22.300]  I'll let you look that up.
[37:22.380 --> 37:26.120]  But it's used typically when you're looking up functions
[37:26.120 --> 37:30.080]  in other DLLs.
[37:31.320 --> 37:32.640]  Most notably
[37:32.640 --> 37:35.460]  I think used in
[37:37.800 --> 37:39.600]  process injection.
[37:40.280 --> 37:41.480]  Anyway.
[37:43.760 --> 37:45.620]  Alright, so there's a lot of numbers.
[37:45.620 --> 37:47.020]  It's fairly classic.
[37:47.780 --> 37:50.360]  But we're seeing write bytes to memory,
[37:50.360 --> 37:53.960]  memory address, a lot of malware type things.
[37:55.140 --> 37:57.160]  Remote proc handle.
[37:57.700 --> 37:59.820]  Obviously this is all PowerShell.
[38:00.480 --> 38:02.540]  But the numbers are actually kind of annoying.
[38:03.960 --> 38:04.760]  So...
[38:06.920 --> 38:09.620]  Let's scroll through.
[38:09.620 --> 38:12.160]  Are there any tokens in there
[38:12.160 --> 38:15.780]  that are particularly interesting to anybody?
[38:29.840 --> 38:34.040]  Rob, Logistic Aggression managed to answer your question.
[38:34.040 --> 38:35.780]  Thanks, Rob.
[38:35.780 --> 38:37.260]  Rob's the man.
[38:37.260 --> 38:41.020]  What was it?
[38:41.580 --> 38:45.400]  He said you get the address of a DLL function in memory.
[38:45.400 --> 38:47.660]  Also heavily used in packers.
[38:48.360 --> 38:49.900]  Yeah, exactly.
[38:52.120 --> 38:54.600]  Okay, so the next bit.
[38:54.740 --> 38:57.920]  We're just going to continue to filter down.
[38:58.320 --> 39:00.560]  And I would say that's generally the case
[39:00.560 --> 39:03.060]  with most of your data.
[39:03.060 --> 39:05.500]  I think most of your time spent
[39:05.840 --> 39:07.200]  when you're doing machine learning
[39:07.200 --> 39:13.020]  isn't the math piece that you typically think of.
[39:13.020 --> 39:15.280]  It's processing data,
[39:15.280 --> 39:20.180]  it's filtering data,
[39:20.180 --> 39:21.780]  it's making sure that
[39:21.780 --> 39:24.240]  your data sets are balanced
[39:24.240 --> 39:27.400]  or have a distribution that you want
[39:27.400 --> 39:30.100]  or whatever it might be.
[39:30.100 --> 39:33.860]  I said earlier that we're not inventing new math.
[39:33.860 --> 39:37.640]  I'm not going to be the guy to invent new math.
[39:39.920 --> 39:42.520]  The place that I can be most effective
[39:43.100 --> 39:45.720]  is applying my domain knowledge
[39:45.720 --> 39:48.200]  to what I know of machine learning
[39:48.200 --> 39:52.260]  and being extra careful with my data.
[39:53.300 --> 39:55.000]  Abraham Lincoln has a saying
[39:55.000 --> 39:56.720]  about sharpening an ax.
[39:56.720 --> 39:58.760]  I can't remember exactly what it is.
[40:00.700 --> 40:02.860]  If I had 10 hours to cut down a tree
[40:02.860 --> 40:05.600]  I would spend the first 8 sharpening my ax.
[40:06.160 --> 40:07.980]  That sounds right, I could have made that up.
[40:08.960 --> 40:11.640]  The same is true for machine learning
[40:11.640 --> 40:13.260]  and I would say data science.
[40:14.120 --> 40:15.160]  I don't know if
[40:17.440 --> 40:18.760]  Comath, do you want to chime in there
[40:18.760 --> 40:20.780]  as a data scientist?
[40:23.100 --> 40:24.760]  It's true or not?
[40:29.500 --> 40:34.180]  I tend to try something that works
[40:34.180 --> 40:35.740]  and then go back and perfect it
[40:35.740 --> 40:37.180]  to make sure I don't get little mistakes
[40:37.180 --> 40:39.580]  because when you scale things up
[40:39.580 --> 40:42.380]  it really screws you.
[40:42.380 --> 40:45.220]  There's a bug 0.1% of the time
[40:45.220 --> 40:47.260]  and now you have 100,000 things
[40:47.260 --> 40:49.780]  the bug is guaranteed to happen.
[40:51.020 --> 40:52.360]  It's better to clean those up
[40:52.360 --> 40:55.520]  before you really...
[40:55.520 --> 40:59.920]  If you're a data scientist
[41:00.760 --> 41:02.060]  in the twitch chat
[41:02.060 --> 41:04.660]  what do you think the percentage of cleaning
[41:04.660 --> 41:06.280]  versus actual machine learning
[41:06.280 --> 41:09.540]  data processing versus machine learning
[41:09.540 --> 41:10.960]  I think that would be interesting
[41:12.260 --> 41:14.040]  and if you're new to machine learning
[41:14.040 --> 41:17.540]  I would say it's okay
[41:17.540 --> 41:20.740]  to be impatient and go really fast
[41:20.740 --> 41:22.040]  because you want to get the end result
[41:22.040 --> 41:25.520]  but if you want to get quality results
[41:25.520 --> 41:28.320]  it is better to go a little slower.
[41:30.720 --> 41:32.240]  We're just going to create a new list
[41:32.240 --> 41:35.320]  and the difference here is we're creating a find all
[41:35.320 --> 41:37.540]  so we're going to start removing some punctuation
[41:37.540 --> 41:40.280]  and numbers and then we're just going to add it
[41:40.280 --> 41:42.360]  to a new list and then we're going to print
[41:42.360 --> 41:45.580]  the top 100 tokens.
[41:48.480 --> 41:50.020]  So all the numbers are gone
[41:50.320 --> 41:51.600]  a lot of the punctuation is gone
[41:52.120 --> 41:53.800]  I'd say that looks quite a bit better
[41:53.800 --> 41:56.300]  than the other one.
[41:57.080 --> 41:59.560]  We still have a getproc address
[42:01.760 --> 42:04.720]  we're still seeing the malicious tokens
[42:04.720 --> 42:05.620]  that we would expect
[42:05.620 --> 42:06.480]  shellcode
[42:07.420 --> 42:10.580]  so we haven't completely ruined our
[42:12.320 --> 42:14.900]  I want to call them crispy bits
[42:14.900 --> 42:17.800]  but we haven't ruined our
[42:17.800 --> 42:20.360]  what we want to model effectively
[42:20.360 --> 42:23.500]  so that's also there.
[42:26.400 --> 42:26.740]  So this bit
[42:26.740 --> 42:29.920]  I just said pick some tokens and go google them
[42:29.920 --> 42:32.320]  and make sure that they're
[42:33.460 --> 42:35.140]  see what comes up
[42:35.740 --> 42:36.960]  that's kind of the fun bit
[42:38.980 --> 42:41.740]  I suppose if we are live I can do that
[42:42.460 --> 42:43.340]  let's see
[42:47.760 --> 42:49.740]  that's what pehandle does
[42:53.920 --> 42:55.160]  I'm using edge
[42:55.160 --> 42:56.540]  that's embarrassing
[43:04.400 --> 43:06.700]  Have you guys heard of Mimikatz?
[43:08.420 --> 43:10.240]  Do you know if that one's malicious?
[43:14.600 --> 43:16.000]  Yeah, I would say so.
[43:17.100 --> 43:18.260]  PowerShell Type Loader?
[43:18.260 --> 43:19.680]  Mimikatz is flagged as malicious, yeah.
[43:19.740 --> 43:21.760]  Yeah, yeah, definitely malicious.
[43:24.780 --> 43:32.500]  And I would say, as an operator, I think a lot of the industry uses a lot of the same scripts.
[43:32.500 --> 43:39.910]  So, yeah.
[43:40.910 --> 43:45.890]  In fact, the DLL injection, bookdllinjection, atgraver...
[43:46.730 --> 43:50.710]  Yeah, so these are malicious type scripts.
[43:52.330 --> 44:01.140]  Okay, so once we have that bit, you know, you could...
[44:01.140 --> 44:02.780]  We could stop here, right?
[44:02.780 --> 44:06.900]  So we have some malicious tokens that we know that, you know,
[44:06.900 --> 44:13.200]  these are all tokens or words that were labeled by Defender as malicious.
[44:13.260 --> 44:19.980]  So we could stop here, theoretically, and have malicious scripts that never, you know,
[44:19.980 --> 44:21.900]  don't use any of these words.
[44:22.100 --> 44:26.040]  Alternatively, you could do the same with the clean scripts
[44:26.040 --> 44:30.560]  and only use, you know, words from those clean scripts.
[44:30.560 --> 44:32.280]  Or, you know, you could do both.
[44:32.280 --> 44:36.200]  But I think that would probably work.
[44:37.540 --> 44:42.840]  But I think you would have a hard time making it repeatable.
[44:42.920 --> 44:45.410]  And you wouldn't necessarily, you know, if something...
[44:45.920 --> 44:51.620]  If it all of a sudden stopped working, I don't think you would have much recourse
[44:51.620 --> 44:54.200]  in being able to figure out how or why.
[44:55.320 --> 44:59.980]  And so I think this is kind of where machine learning shines for us.
[44:59.980 --> 45:07.440]  And it's just the ability to iterate through massive amounts of data extremely quickly.
[45:07.940 --> 45:14.880]  So, you know, while... I'm going to use the GPT-2 fishing analogy.
[45:14.880 --> 45:25.000]  And it's like, you can spend an hour crafting, you know, one fish for five targets.
[45:25.000 --> 45:34.260]  Or you could spend an hour correcting five unique fish generated out of GPT-2 for five targets.
[45:35.580 --> 45:42.320]  GPT-2 language models aren't perfect, but they do help scale.
[45:43.200 --> 45:46.580]  You know, so it's like you generate five of them and you can correct them,
[45:46.580 --> 45:52.160]  make them seem, you know, realistic and not sound terrible.
[45:52.840 --> 45:55.740]  In the same time you could to do one.
[45:55.740 --> 46:01.620]  So, you know, in terms of pulling out insights from Windows Defender, I think this is also true.
[46:02.200 --> 46:06.580]  All right. So we're getting to... we got to the machine learning bit.
[46:07.360 --> 46:12.460]  And in this manual, you know, we referenced 380,000 scripts.
[46:12.460 --> 46:18.760]  There are, I think, 410 in the whole piece.
[46:18.760 --> 46:24.080]  So in that PowerShell link at the top, you know, it has that many.
[46:24.740 --> 46:28.620]  We just use... I just chose the biggest folder that had them in there.
[46:30.560 --> 46:33.780]  And then introduced them to Insight.
[46:35.960 --> 46:39.060]  So we're going to get into some of the data representation.
[46:39.060 --> 46:44.300]  I think data representation is probably my favorite part of machine learning.
[46:44.300 --> 46:54.260]  It's really where you get to shape... kind of shape the output and shape...
[46:54.260 --> 46:56.840]  not the model, because the model has its own architecture,
[46:56.840 --> 47:04.060]  but you get to encode and embed your domain knowledge when you...
[47:04.060 --> 47:09.620]  you know, I guess it would be called... it's fine if I'm wrong,
[47:09.620 --> 47:12.240]  but you could call it feature engineering.
[47:13.820 --> 47:17.720]  And that piece is extremely important.
[47:17.720 --> 47:23.580]  So, you know, if you want to...
[47:23.580 --> 47:32.140]  the outputs of your model are... the accuracy of your model is a direct representation,
[47:32.140 --> 47:37.260]  correlation with the quality of your data representation and your feature engineering.
[47:38.320 --> 47:40.400]  And it's... yeah, it's my favorite part.
[47:42.020 --> 47:44.880]  Okay, so...
[47:46.940 --> 47:52.320]  GT went through tokenization earlier today,
[47:52.740 --> 47:55.140]  but we've already kind of played with tokenization,
[47:55.140 --> 48:00.140]  but tokenization is effectively the process of splitting,
[48:00.140 --> 48:05.240]  you know, the words into separate words.
[48:05.240 --> 48:08.440]  So, you know, each...
[48:08.440 --> 48:11.420]  I'm losing it.
[48:11.800 --> 48:13.240]  Not into separate words.
[48:13.240 --> 48:16.560]  You piece out sentences into a list.
[48:16.560 --> 48:20.360]  So once we have a tokenizer, rather than...
[48:20.360 --> 48:24.460]  we could write it ourself, like we did up here,
[48:24.460 --> 48:28.540]  but I would imagine there's probably better developers out there,
[48:28.540 --> 48:34.860]  and I trust them probably more than I trust my own code for some reason.
[48:35.180 --> 48:36.720]  But if you imagine it as a corpus,
[48:36.720 --> 48:41.040]  so if you imagine this corpus is actually just our big PowerShell list,
[48:41.040 --> 48:44.320]  so each line is a PowerShell script,
[48:44.320 --> 48:46.320]  you know, just a list of it,
[48:46.320 --> 48:48.240]  and we're going to fit on text,
[48:48.240 --> 48:50.820]  and this is just to create an index.
[48:51.180 --> 48:53.960]  So we're just creating those data structures,
[48:53.960 --> 49:01.120]  and you can see, you know, each word gets its own integer.
[49:01.860 --> 49:05.140]  And, you know, in terms of sequential numbers,
[49:05.140 --> 49:08.140]  like in a text prediction scenario,
[49:08.140 --> 49:09.960]  like sequences are really nice,
[49:10.560 --> 49:14.520]  because you can say, you know, what number comes after what,
[49:14.520 --> 49:17.680]  and you can go back to your index and look it up.
[49:23.220 --> 49:26.980]  Yeah, so if we can go, my cat likes mittens.
[49:29.280 --> 49:32.820]  I don't have a cat, but I assume cats like mittens.
[49:34.380 --> 49:36.980]  Or most cats are called mittens.
[49:38.740 --> 49:42.520]  But we're just taking the first index and we're tokenizing it.
[49:49.920 --> 49:52.080]  Yeah, six small cats.
[49:52.540 --> 49:54.300]  Yeah, it's a word index.
[49:54.460 --> 49:55.900]  It's fairly straightforward.
[49:57.460 --> 50:01.460]  And there are a number of ways to represent text.
[50:01.580 --> 50:03.700]  You have machine learning models that'll do it,
[50:03.700 --> 50:09.400]  you have frequency models,
[50:09.400 --> 50:12.520]  you have TF-IDF,
[50:12.520 --> 50:18.460]  which is Term Frequency Inverse Document Frequency,
[50:18.460 --> 50:21.460]  which is like a weighted frequency.
[50:21.800 --> 50:23.200]  What else do you have?
[50:24.260 --> 50:25.740]  What am I missing?
[50:28.260 --> 50:31.600]  Obviously, one-hot encoding, that's where we're going to go next.
[50:33.340 --> 50:39.120]  So one-hot encoding is, basically, it's a vector,
[50:39.120 --> 50:42.240]  which is the length of your vocabulary.
[50:43.200 --> 50:49.640]  And the presence of a token or a word is denoted by a 1.
[50:49.640 --> 50:52.660]  And if it doesn't exist in a document, then it's a 0.
[50:53.200 --> 50:55.240]  And so what you end up getting is...
[50:55.240 --> 50:58.280]  Let's see how long these are...
[50:58.280 --> 50:59.400]  Well, it's only 17.
[50:59.400 --> 51:06.960]  So each vector is going to be 17 integers long.
[51:06.960 --> 51:10.660]  And the array is going to look like this.
[51:10.660 --> 51:13.580]  So this is just the first sentence.
[51:13.580 --> 51:19.960]  But as we go through, you'll see as the words change,
[51:20.520 --> 51:23.480]  you get different... what do you call it?
[51:23.560 --> 51:25.920]  Not indications, activations?
[51:25.920 --> 51:28.460]  That's probably incorrect.
[51:28.960 --> 51:32.020]  You get different indications.
[51:34.880 --> 51:38.260]  Hopefully that makes a little bit of sense.
[51:39.520 --> 51:41.380]  And let's see...
[51:41.380 --> 51:50.960]  The nice part about one-hot encoding is that you do get every token in.
[51:51.880 --> 51:57.420]  Once they're well represented, you do get some sparsity.
[51:57.420 --> 52:06.180]  So if you have a really big corpus or vocabulary, you get some sparsity in there.
[52:08.040 --> 52:12.060]  But it's really nice if you only care about the presence of a word.
[52:12.060 --> 52:20.160]  It doesn't necessarily keep the semantics or the syntax.
[52:23.060 --> 52:26.500]  It's not very good for text prediction.
[52:26.500 --> 52:29.760]  It is good for classification.
[52:30.640 --> 52:32.100]  Which is kind of what we want to do.
[52:33.120 --> 52:38.580]  Alright, so now that we have an idea of what tokenization does and the tokenization scheme that we're going to use,
[52:38.580 --> 52:44.240]  we're going to get into my favorite data structure, which is named tuples.
[52:44.980 --> 52:52.480]  Named tuples are immutable... well, they're tuples, but they're like immutable classes.
[52:52.480 --> 52:57.940]  So you can kind of create them, you can set your variables to whatever you want,
[52:57.940 --> 53:02.440]  and it's really nice because then you can turn around and reference a tuple,
[53:02.440 --> 53:05.900]  or reference a variable inside of a tuple, by name,
[53:05.900 --> 53:15.780]  which makes your code read a lot cleaner than if you were just going to use a tuple or a list or even a dictionary.
[53:15.780 --> 53:20.660]  Dictionaries can be... actually, I don't know the speed difference between dictionaries and named tuples.
[53:20.660 --> 53:22.120]  Does anybody know?
[53:23.780 --> 53:24.840]  If there's any?
[53:26.660 --> 53:27.740]  Maybe not.
[53:28.300 --> 53:30.140]  But this is an example of a named tuple.
[53:31.860 --> 53:36.480]  It's like when you find something you like and you just use them absolutely everywhere.
[53:36.780 --> 53:38.660]  This is where I'm at with named tuples.
[53:40.800 --> 53:42.460]  Just because they're super nice.
[53:44.890 --> 53:45.970]  Alright.
[53:46.390 --> 53:52.250]  So this code, we are going to kind of create our training structure.
[53:52.350 --> 53:57.430]  So this is... you should be familiar with everything in this code block by now.
[53:57.430 --> 54:03.370]  So we are... we have our filter.
[54:03.550 --> 54:06.430]  The tokenizer has a filter attached to it.
[54:07.390 --> 54:12.870]  If you want to use that one, we're going to create a primary list to hold all of our training data,
[54:12.870 --> 54:17.990]  and then we're going to hold a... we're going to create a list that we're going to use for tokenizing.
[54:18.630 --> 54:22.850]  And this is kind of... eventually, when you start building these small pieces,
[54:22.850 --> 54:25.970]  eventually there's a point where they kind of all come together,
[54:25.970 --> 54:29.830]  and you can transition from processing into learning.
[54:30.110 --> 54:33.290]  And this script here is where we do that.
[54:33.650 --> 54:38.850]  So, all of a sudden we can... I shouldn't say script, but...
[54:38.850 --> 54:44.370]  So we've built all these tiny pieces out, and you can kind of see how they're coming together.
[54:44.710 --> 54:50.510]  And there's two really important things happening in this little code block.
[54:51.350 --> 55:02.750]  Obviously the tokenization is happening, and we are collecting all of the... all of the named tuples.
[55:02.750 --> 55:03.770]  So I know named tuples.
[55:03.770 --> 55:07.050]  So we're going to go through every malicious script.
[55:07.050 --> 55:09.570]  We're going to decode it as we did before.
[55:09.570 --> 55:11.610]  We're going to remove all the punctuation.
[55:11.610 --> 55:12.770]  We're going to split it.
[55:12.770 --> 55:14.170]  We're going to unique it.
[55:14.790 --> 55:20.050]  I did this because I assume it makes tokenization faster, but I don't...
[55:21.010 --> 55:22.750]  As I said, I don't have any evidence for that.
[55:22.750 --> 55:24.530]  It just seemed to make sense.
[55:25.990 --> 55:27.870]  Maybe you guys can try and let me know.
[55:29.350 --> 55:35.490]  We are going to then rejoin the unique tokens, unique words in a sentence,
[55:35.490 --> 55:43.350]  and we're just going to add them to a named tuple, and then add that named tuple to a list.
[55:43.350 --> 55:47.390]  So the important piece here is the label.
[55:50.410 --> 55:58.410]  For malware, the label is a 1, and then for clean, labels are a 0.
[55:58.410 --> 56:02.110]  So this is where we're kind of creating our training structure.
[56:02.930 --> 56:13.110]  The other piece you might notice is originally we had 380,000 scripts that we ran this on for this demonstration,
[56:13.110 --> 56:16.770]  this workshop that GitHub didn't really like it.
[56:17.370 --> 56:23.610]  So we only gave you 3,000, but that's still like a 2-to-1.
[56:23.610 --> 56:31.630]  But when you're having 380,000 scripts versus almost 1,200 scripts, your data is very unbalanced.
[56:31.630 --> 56:40.070]  So if you, for example, were to do a similar scheme as we did earlier with just the malicious scripts, we did it for both,
[56:40.070 --> 56:47.630]  what might end up happening is your clean script words will start to...
[56:49.510 --> 56:51.770]  The word isn't drown out.
[56:54.430 --> 56:56.290]  What's the word, Eric?
[56:56.390 --> 56:57.190]  Spin.
[56:57.190 --> 56:58.990]  What's the technical term?
[57:04.400 --> 57:05.980]  I'm going to say drown out.
[57:08.220 --> 57:10.820]  Or average out, whatever it might be.
[57:10.820 --> 57:15.720]  It's like if your GPA is really bad, eventually you just can't get it up.
[57:18.420 --> 57:19.120]  Anyway.
[57:19.200 --> 57:20.940]  Alright, so then at the end of this...
[57:20.940 --> 57:30.220]  The outcome of this is we're going to get a list that has all of our named tuples that are ready to be fed into a model,
[57:30.220 --> 57:33.840]  and then we're going to have all of our documents that are ready to be tokenized.
[57:33.840 --> 57:37.780]  So at the end of this, we're just going to build the vocab, and we're going to have everything.
[57:38.780 --> 57:40.420]  This is a little blurb.
[57:43.060 --> 57:46.000]  So it takes a little bit to go through all of them.
[57:57.480 --> 57:58.940]  It's because I opened this.
[58:03.050 --> 58:04.570]  I don't want to tokenize that.
[58:06.130 --> 58:15.210]  Also, you guys probably know this, but if you guys use Vim, it creates a swap file.
[58:15.210 --> 58:30.430]  So I helped debug some issues where they were opening their data files inside of a folder and running Python out of the same folder.
[58:30.430 --> 58:36.030]  And so their Python was trying to tokenize their text, and they couldn't figure out the error.
[58:36.030 --> 58:39.070]  And it was just because they were trying to read a swap file.
[58:41.370 --> 58:43.950]  So now we have our vocab.
[58:50.570 --> 58:54.050]  And we can kind of see the tokens that came out.
[58:54.050 --> 58:58.230]  These are the tokens after our filtering.
[58:58.730 --> 59:10.810]  The closer you get to training, you want to be increasingly happy with what you're seeing in terms of the words or the tokens.
[59:11.230 --> 59:15.090]  And I think that I'm more happy with this than when we started.
[59:15.710 --> 59:16.970]  But definitely play by ear.
[59:16.970 --> 59:20.930]  But it is useful to take a moment and see.
[59:21.010 --> 59:22.050]  Just take a look.
[59:24.290 --> 59:26.110]  Remote DLL handle.
[59:26.110 --> 59:26.810]  Just take a look.
[59:28.050 --> 59:29.550]  There's quite a lot of them.
[59:31.110 --> 59:37.730]  So this is something I haven't dealt with, and you can see in the insights.xls.
[59:37.730 --> 59:41.910]  But if you were going to deal with it, this is probably how you would do it.
[59:42.570 --> 59:47.590]  So you just want to do a regex for this TVQQAAMA.
[59:47.590 --> 59:49.770]  Does anybody know what that is?
[59:50.450 --> 59:51.590]  I bet Rob does.
[01:00:05.220 --> 01:00:07.340]  I see you're binging it.
[01:00:07.340 --> 01:00:10.740]  Yeah, this is embarrassing.
[01:00:12.480 --> 01:00:14.360]  Bing, the best way to Google.
[01:00:16.360 --> 01:00:20.620]  But yeah, you'd want to remove those obviously.
[01:00:21.560 --> 01:00:23.100]  I didn't.
[01:00:23.820 --> 01:00:27.100]  It indicates that it is a PE file.
[01:00:31.770 --> 01:00:37.510]  Compressed or embedded in some other medium.
[01:00:38.190 --> 01:00:40.790]  Okay, so now we're at the point where we are...
[01:00:41.310 --> 01:00:43.770]  Well, I already described this, but...
[01:00:43.770 --> 01:00:47.930]  If you're not familiar with machine learning at all, whatsoever,
[01:00:49.370 --> 01:00:51.750]  there's a little video series by...
[01:00:53.010 --> 01:00:54.510]  What's his name? I don't know.
[01:00:54.510 --> 01:00:57.270]  But his YouTube channel is 3Blue1Brown.
[01:00:57.270 --> 01:00:59.470]  It's just like 4 videos an hour.
[01:00:59.530 --> 01:01:01.150]  It is...
[01:01:02.490 --> 01:01:03.690]  It's really good.
[01:01:03.690 --> 01:01:08.790]  It's going to be way better than I will ever be able to explain machine learning to you.
[01:01:09.430 --> 01:01:12.950]  So I recommend having a look through that.
[01:01:13.670 --> 01:01:14.630]  Watch it a couple times.
[01:01:14.630 --> 01:01:16.850]  And if you're really into math, he has a lot of cool stuff.
[01:01:18.350 --> 01:01:21.290]  Yeah, his animations are really good.
[01:01:23.290 --> 01:01:26.770]  Neural networks, we've kind of all seen this picture.
[01:01:27.550 --> 01:01:30.150]  But our input layers represent our input data.
[01:01:30.150 --> 01:01:32.610]  Or our one-hot encoded text.
[01:01:32.630 --> 01:01:36.070]  Hidden layers represent an activation function.
[01:01:36.610 --> 01:01:41.170]  And the output layer or node is the result of the network.
[01:01:41.170 --> 01:01:43.630]  Or classification, prediction, whatever it might be.
[01:01:44.910 --> 01:01:47.290]  There are a lot of...
[01:01:47.290 --> 01:01:52.670]  Machine learning is more than this little picture that we see everywhere.
[01:01:54.130 --> 01:02:02.590]  And I would say this is like the tiniest little bit of machine learning.
[01:02:02.590 --> 01:02:10.530]  And even inside of this picture, the mechanics that are going on are quite extensive.
[01:02:11.490 --> 01:02:16.590]  And this is just like the most vanilla machine learning implementation.
[01:02:16.590 --> 01:02:30.550]  I think the nice part about Keras and the frameworks is they bring that kind of ability and power to non-mathematicians.
[01:02:32.090 --> 01:02:35.510]  What do you call non-mathematicians, Sven?
[01:02:37.050 --> 01:02:38.530]  I don't.
[01:02:38.530 --> 01:02:40.730]  Is there an industry word for them?
[01:02:44.710 --> 01:02:45.590]  Students?
[01:02:46.550 --> 01:02:46.990]  What?
[01:02:47.470 --> 01:02:48.350]  Students?
[01:02:49.450 --> 01:02:50.490]  Yeah, something like that.
[01:02:50.550 --> 01:02:51.170]  Normies.
[01:02:51.370 --> 01:02:52.750]  Normies, yeah.
[01:02:53.270 --> 01:02:57.010]  Honestly, we just don't think about non...
[01:02:57.010 --> 01:02:58.010]  Yeah.
[01:02:58.270 --> 01:03:00.450]  I don't want to dig myself into a hole here.
[01:03:00.530 --> 01:03:02.710]  You've got too much math to think about.
[01:03:02.790 --> 01:03:06.110]  As a non-mathematician, I take great offense to this.
[01:03:07.010 --> 01:03:08.290]  Yeah, yeah.
[01:03:08.290 --> 01:03:09.810]  Laypeople is right, yeah.
[01:03:09.810 --> 01:03:11.730]  Rich has got it right. Laypeople.
[01:03:11.990 --> 01:03:13.270]  Yeah, laypeople.
[01:03:13.510 --> 01:03:20.830]  And if you are... I think the first book I read was Make Your Own Neural Network by Tariq Rashid.
[01:03:22.370 --> 01:03:25.790]  And that was... it was just the most... it was like 120 pages.
[01:03:25.790 --> 01:03:33.830]  It was just the most basic, most straightforward explanation of a neural network.
[01:03:33.830 --> 01:03:48.750]  I think there's a bajillion tutorials out there, and I think they are okay, but they're always so quick to get into a framework without actually explaining what's going on underneath.
[01:03:50.010 --> 01:03:52.090]  And that is definitely a hindrance.
[01:03:52.090 --> 01:03:57.890]  You don't have to be a mathematician to use machine learning, but you should at least spend time learning the basics.
[01:04:00.890 --> 01:04:01.770]  Okay.
[01:04:03.010 --> 01:04:03.770]  So.
[01:04:04.830 --> 01:04:05.830]  Let's see.
[01:04:05.830 --> 01:04:07.570]  We have a little explanation of machine learning.
[01:04:07.570 --> 01:04:09.290]  We have our training set.
[01:04:09.290 --> 01:04:11.250]  We're happy with the tokenization.
[01:04:12.370 --> 01:04:13.690]  To a point.
[01:04:13.870 --> 01:04:15.830]  Happy enough for now, anyway.
[01:04:16.270 --> 01:04:18.370]  You know, eventually...
[01:04:18.370 --> 01:04:22.690]  I think you always want to be moving forward, so you always kind of want to be thinking in pipelines.
[01:04:22.690 --> 01:04:37.650]  So I wouldn't spend too much time necessarily in one area, but when you build it, make it such that it is modular and can be put into a pipeline.
[01:04:37.970 --> 01:04:38.650]  Let's see.
[01:04:38.650 --> 01:04:43.350]  So now we need to tokenize all the documents.
[01:04:43.350 --> 01:04:59.130]  And this will take a list, which I think docs is.
[01:05:01.230 --> 01:05:02.730]  We'll find out.
[01:05:07.320 --> 01:05:08.040]  Nice.
[01:05:08.040 --> 01:05:10.200]  And then we can... let's just double check.
[01:05:16.390 --> 01:05:17.630]  Yeah, that's good.
[01:05:21.470 --> 01:05:22.430]  Alright.
[01:05:22.530 --> 01:05:26.290]  And now we're going to create our score array.
[01:05:26.290 --> 01:05:28.270]  So here we have...
[01:05:28.270 --> 01:05:30.630]  So this is a named tuple that we're going to go through.
[01:05:30.630 --> 01:05:36.750]  So we're referencing E, which is a terrible variable name, sorry.
[01:05:36.890 --> 01:05:40.870]  But alt text is the list that's holding all of our named tuples.
[01:05:41.430 --> 01:05:45.090]  And this is called list comprehension.
[01:05:47.710 --> 01:05:49.850]  Another favorite of mine.
[01:05:50.130 --> 01:05:52.090]  So it creates a list.
[01:05:52.210 --> 01:05:56.710]  Put a function inside of square brackets and it will create you a list.
[01:05:56.710 --> 01:05:58.950]  But you can see we're referencing the labels.
[01:05:58.950 --> 01:06:00.830]  So we're just going to create a score matrix.
[01:06:00.830 --> 01:06:03.130]  And actually you can see what that looks like.
[01:06:06.010 --> 01:06:07.910]  But they're just giant arrays.
[01:06:11.830 --> 01:06:24.630]  And the reason they're giant arrays is because mechanics of machine learning is kind of...
[01:06:24.630 --> 01:06:27.670]  Would you say rooted in matrix multiplication?
[01:06:30.170 --> 01:06:35.830]  No, because that ignores all of the decision trees and a bunch of other things.
[01:06:35.830 --> 01:06:36.490]  Yeah.
[01:06:41.030 --> 01:06:47.270]  So when I was talking about this picture being kind of dumb, it's the only picture we ever see.
[01:06:47.350 --> 01:06:53.050]  And it's just one of the smallest pieces.
[01:06:53.330 --> 01:06:57.270]  There's so much more out there that you should look at.
[01:06:57.270 --> 01:07:00.430]  And I probably could have introduced you to.
[01:07:01.050 --> 01:07:03.170]  So now we have our score matrix.
[01:07:04.410 --> 01:07:06.570]  Actually no, I'm looking at this.
[01:07:17.230 --> 01:07:18.070]  Anyway.
[01:07:20.310 --> 01:07:22.670]  So these are all our labels.
[01:07:23.030 --> 01:07:31.310]  So when the network is learning, it's going to calculate a loss.
[01:07:31.310 --> 01:07:32.710]  So it's going to have an input.
[01:07:33.410 --> 01:07:40.150]  And then at the end of that it's going to take whatever the label was and see how close it was to the label.
[01:07:40.150 --> 01:07:43.970]  So a large loss is bad and a small loss is good.
[01:07:44.750 --> 01:08:00.950]  And through gradient descent and back propagation, the network will update the weights such that the next time it sees or there's something labeled like that, it'll hopefully be closer.
[01:08:01.290 --> 01:08:03.030]  And we'll get into that.
[01:08:06.250 --> 01:08:09.170]  I like this. This is probably one of my favorites.
[01:08:15.100 --> 01:08:21.700]  So TAS vs. SPACES, NANO vs. VSCODE, MESSI vs. that other guy.
[01:08:21.940 --> 01:08:24.360]  Machine learning frameworks are no different.
[01:08:26.020 --> 01:08:29.000]  So do you guys have preferences for machine learning frameworks?
[01:08:29.840 --> 01:08:36.800]  Why do you choose the machine learning framework you do?
[01:08:39.950 --> 01:08:46.330]  I use PyTorch because it tends to look more like real code than TensorFlow.
[01:08:46.490 --> 01:08:49.870]  But Keras is super easy to get up and running.
[01:08:50.090 --> 01:08:51.430]  So it's a good choice.
[01:08:54.130 --> 01:09:00.190]  And then when you do the real web stuff, I reach into a compiled language like Rust or C.
[01:09:00.970 --> 01:09:02.530]  Oh, C, that's brave.
[01:09:04.470 --> 01:09:06.610]  Has anybody used ML.NET?
[01:09:09.170 --> 01:09:11.510]  Did anybody know ML.NET existed?
[01:09:13.130 --> 01:09:14.090]  No.
[01:09:15.030 --> 01:09:16.510]  Depends on what you want to do.
[01:09:16.510 --> 01:09:19.590]  I have a Win32, so DirectX.
[01:09:19.630 --> 01:09:23.330]  There's a Win32 machine learning implementation.
[01:09:23.330 --> 01:09:29.430]  Actually, it's just an interface for TensorFlow on X models.
[01:09:31.290 --> 01:09:33.070]  Yeah, I prefer Keras.
[01:09:33.070 --> 01:09:34.050]  It's simple.
[01:09:34.170 --> 01:09:41.310]  If all the math is the same, then you're really just looking at what you prefer.
[01:09:44.350 --> 01:09:45.990]  Keras is good.
[01:09:45.990 --> 01:09:49.290]  To get off the ground, but it may limit your customization.
[01:09:51.530 --> 01:09:54.090]  And I think that's exactly what an API...
[01:09:54.090 --> 01:09:58.730]  It doesn't limit you, it abstracts complexity.
[01:09:58.990 --> 01:10:03.830]  So I think the reason Keras is so easy to use is because it does a really good job of abstracting complexity.
[01:10:03.830 --> 01:10:08.670]  But if you want that complexity you want to dig in, it's still there for you.
[01:10:08.670 --> 01:10:11.750]  You just have to potentially dig a little deeper.
[01:10:11.750 --> 01:10:15.910]  Versus something like PyTorch, which is...
[01:10:17.210 --> 01:10:20.610]  pretty raw, I would say.
[01:10:22.310 --> 01:10:24.150]  That's probably not true.
[01:10:24.650 --> 01:10:28.730]  Some guy in the back is yelling about MATLAB right now.
[01:10:29.510 --> 01:10:32.250]  Oh, MATLAB. That's a whole other
[01:10:32.250 --> 01:10:36.570]  stack of words you'll have a little bit of PTSD about.
[01:10:36.650 --> 01:10:39.450]  Yeah, don't ever say MATLAB.
[01:10:40.410 --> 01:10:41.410]  Rich is saying
[01:10:43.450 --> 01:10:46.730]  Keras is good for 99% of your projects, and then PyTorch
[01:10:46.730 --> 01:10:50.830]  for when you're getting freaky. And I say PyTorch is good for 90% of your projects
[01:10:50.830 --> 01:10:55.690]  and Rust for when you're getting freaky. There's also JAX, all sorts of stuff.
[01:10:55.690 --> 01:10:58.310]  Oh yeah, JAX. Does anybody use JAX? Do they like it?
[01:10:58.310 --> 01:11:00.930]  I know Jason. Is he in here?
[01:11:03.330 --> 01:11:06.130]  Yeah, it's kind of the Wild West, I would say.
[01:11:07.550 --> 01:11:10.050]  There's a bajillion different frameworks.
[01:11:10.050 --> 01:11:12.350]  Everybody uses something different.
[01:11:15.470 --> 01:11:18.250]  I like TensorFlow. It has TensorFlow
[01:11:18.250 --> 01:11:22.030]  Serving. It seems to have a good ecosystem
[01:11:22.030 --> 01:11:26.750]  around it. But you obviously make trade-offs, right?
[01:11:26.750 --> 01:11:28.850]  But if your
[01:11:29.810 --> 01:11:34.330]  verbiage and your fundamentals are good, then I think you could probably use just about any language
[01:11:34.330 --> 01:11:35.890]  with enough practice.
[01:11:37.650 --> 01:11:39.910]  We're going to create our model now.
[01:11:43.850 --> 01:11:45.770]  It's the same as the picture
[01:11:46.670 --> 01:11:50.030]  effectively, but we're just going to create it in code.
[01:11:50.030 --> 01:11:53.230]  One thing we didn't talk about are activation functions.
[01:11:55.090 --> 01:11:57.190]  Actually, we didn't talk about it a lot.
[01:12:00.030 --> 01:12:02.450]  As your data traverses from weights
[01:12:02.450 --> 01:12:06.590]  into hidden layers and hidden nodes,
[01:12:07.150 --> 01:12:10.710]  weights, their job is to modulate
[01:12:10.710 --> 01:12:17.490]  inputs based on
[01:12:17.490 --> 01:12:19.770]  their inputs.
[01:12:20.030 --> 01:12:23.730]  Such that the output is relative.
[01:12:31.090 --> 01:12:31.790]  Sorry.
[01:12:31.790 --> 01:12:33.730]  I just lost him in the chat.
[01:12:34.630 --> 01:12:38.170]  We're just going to use Sigmoid. I think Sigmoid is super simple.
[01:12:38.170 --> 01:12:39.590]  It's a good place to start.
[01:12:43.030 --> 01:12:46.350]  What are some other favorites? I know Relu
[01:12:46.350 --> 01:12:49.770]  has replaced Sigmoid.
[01:12:50.510 --> 01:12:52.810]  Is there a good reason for this?
[01:12:52.810 --> 01:12:58.250]  Does everybody go, oh, this is better, so we're going to use it all the time now?
[01:12:59.510 --> 01:13:01.070]  The Relu paper
[01:13:04.370 --> 01:13:05.330]  basically said
[01:13:05.330 --> 01:13:08.850]  Relu gets better accuracy scores than Sigmoid.
[01:13:08.850 --> 01:13:13.030]  They just compared that across a bunch of things. For images, Relu
[01:13:13.030 --> 01:13:17.150]  tends to have a higher accuracy score
[01:13:18.950 --> 01:13:20.630]  than Sigmoid does.
[01:13:20.630 --> 01:13:24.610]  And that's basically why it won, because on images you can easily
[01:13:24.610 --> 01:13:28.750]  show that it's better. I don't know whether it's concretely better
[01:13:28.750 --> 01:13:31.390]  in all situations because of that.
[01:13:32.130 --> 01:13:34.730]  You're showing your age with Sigmoid here.
[01:13:36.350 --> 01:13:39.250]  Relu also handles the disappearing
[01:13:40.990 --> 01:13:42.110]  differential.
[01:13:44.050 --> 01:13:48.690]  Vanishing gradient. Where if you go too far into
[01:13:48.690 --> 01:13:52.570]  the negatives or too far into the positives, you're never going to crawl yourself
[01:13:52.570 --> 01:13:53.630]  back out.
[01:13:55.990 --> 01:14:00.190]  And then there's Leaky Relu that tries to help that even more.
[01:14:00.710 --> 01:14:04.670]  And there's Ellu who tries to say, hey guys, you could be
[01:14:04.670 --> 01:14:07.810]  Relu, you could be Leaky Relu, let's just combine the two.
[01:14:07.810 --> 01:14:12.250]  I think it never goes below negative one as a weight.
[01:14:13.870 --> 01:14:15.470]  I think this is
[01:14:16.350 --> 01:14:19.470]  actually, I like this discussion. I don't know how many
[01:14:19.470 --> 01:14:23.250]  of you listening are machine learning people to begin with.
[01:14:25.130 --> 01:14:27.610]  And this is why the tutorials are kind of
[01:14:27.610 --> 01:14:30.550]  they're nice to get you off the ground, but I think they're limiting
[01:14:31.050 --> 01:14:34.150]  in the fact that they're always using kind of the same
[01:14:35.710 --> 01:14:39.510]  architectures. So they're never to say, oh this is
[01:14:39.510 --> 01:14:43.430]  good for this and this is good for this, and that's true. And you might end up
[01:14:43.430 --> 01:14:47.810]  there anyway like they are. Maybe there's just lessons they've learned that you haven't.
[01:14:48.010 --> 01:14:50.390]  But I think it is important to explore
[01:14:52.070 --> 01:14:54.830]  different architectures, different losses, just play with
[01:14:56.350 --> 01:14:59.270]  numbers. I think machine learning is ultimately
[01:14:59.270 --> 01:15:02.550]  about iteration and experimentation
[01:15:03.150 --> 01:15:07.930]  at scale. And that includes
[01:15:07.930 --> 01:15:11.050]  activation functions, that includes output nodes, that includes
[01:15:11.050 --> 01:15:15.070]  any lever you can pull. You should pull it
[01:15:15.070 --> 01:15:16.810]  and see what happens.
[01:15:18.210 --> 01:15:23.190]  Also, as a data scientist, you
[01:15:23.190 --> 01:15:27.130]  should pull the levers and see what happens. Because you might
[01:15:27.130 --> 01:15:31.170]  find that on security data, sigmoid works better than relu in this
[01:15:31.170 --> 01:15:35.050]  case. And if you never tried sigmoid because you just go with standard
[01:15:35.050 --> 01:15:37.970]  relu, you'd never learn that.
[01:15:40.970 --> 01:15:42.230]  Read everything,
[01:15:42.230 --> 01:15:45.030]  try everything. Hang out
[01:15:45.670 --> 01:15:50.670]  at AI village. Super smart people.
[01:15:50.810 --> 01:15:53.230]  Just hang out at journal club and
[01:15:54.710 --> 01:15:57.490]  be a little fly on the wall. That's what I do.
[01:15:58.250 --> 01:16:02.030]  Okay, so we've got our model now. One thing we need to do is we need
[01:16:02.030 --> 01:16:06.310]  to put our documents,
[01:16:06.310 --> 01:16:10.030]  our vectors,
[01:16:10.030 --> 01:16:14.470]  into our model. And when I was first starting
[01:16:14.470 --> 01:16:18.430]  this was actually one of the harder pieces that I had to fit in my
[01:16:18.430 --> 01:16:22.010]  brain. I think that's true. It's just the shape
[01:16:23.610 --> 01:16:26.270]  of arrays and how they get introduced into
[01:16:26.270 --> 01:16:30.470]  models. Does anyone in the chat want
[01:16:30.470 --> 01:16:33.410]  to take a stab at
[01:16:34.130 --> 01:16:36.910]  what these three question marks should be?
[01:16:56.060 --> 01:16:58.180]  You can try it.
[01:16:59.360 --> 01:17:01.740]  I'm hung up on the missing parentheses.
[01:17:04.180 --> 01:17:05.300]  Hey, you're right.
[01:17:09.030 --> 01:17:12.530]  Yeah, but Rich has got it right. The feature size.
[01:17:12.970 --> 01:17:17.430]  Yeah, exactly. Let's see, what is that? That would be...
[01:17:30.560 --> 01:17:32.060]  I don't know.
[01:18:02.140 --> 01:18:05.140]  Pretty sure it's your matrix.shape
[01:18:06.420 --> 01:18:07.620]  variable.
[01:18:18.060 --> 01:18:20.220]  Text matrix shape.
[01:18:23.250 --> 01:18:25.730]  Man, that is one thing.
[01:18:25.730 --> 01:18:30.530]  What are the things that you guys google constantly?
[01:18:38.700 --> 01:18:40.060]  Sorry, you guys must be
[01:18:41.040 --> 01:18:43.940]  probably missing some imports.
[01:18:45.020 --> 01:18:48.160]  I'll push a new version. Alright, so if you're missing those imports
[01:18:48.160 --> 01:19:20.940]  you're going to want to add. Nice.
[01:19:20.980 --> 01:19:25.320]  We created our model. It's super simple.
[01:19:26.540 --> 01:19:30.100]  The text matrix, the input dim that we're
[01:19:30.100 --> 01:19:32.240]  doing is an array of
[01:19:34.500 --> 01:19:37.220]  let's see, probably like 80
[01:19:39.940 --> 01:19:41.100]  86,000
[01:19:41.100 --> 01:19:44.480]  tokens. So across this you're going to see
[01:19:48.050 --> 01:19:51.590]  that's going to look obviously like this.
[01:19:51.590 --> 01:19:54.850]  So we just have, I think, 1,200
[01:19:55.450 --> 01:19:59.230]  samples times 80,000 tokens
[01:19:59.230 --> 01:20:05.440]  long. And we're just going to feed it
[01:20:05.440 --> 01:20:09.000]  in. Nice.
[01:20:09.000 --> 01:20:12.720]  And then we're going to do a test train split.
[01:20:15.960 --> 01:20:17.080]  And when you
[01:20:17.080 --> 01:20:21.080]  build a model, you obviously have test data and you want to keep some test data out.
[01:20:21.120 --> 01:20:25.120]  But the idea is eventually that your model will be deployed into the
[01:20:25.120 --> 01:20:29.060]  real world where it's not going to be trained on, it's not going to be seeing test
[01:20:29.060 --> 01:20:33.080]  data necessarily. So ideally you want to keep some data
[01:20:33.080 --> 01:20:36.260]  away or out of your training set.
[01:20:36.260 --> 01:20:38.980]  So when it sees real data
[01:20:39.820 --> 01:20:42.940]  or when it sees data it hasn't seen before
[01:20:44.040 --> 01:20:48.620]  it can make hopefully an accurate guess.
[01:20:49.800 --> 01:20:51.440]  So splitting them out
[01:20:54.520 --> 01:20:56.420]  for that reason.
[01:20:56.620 --> 01:21:00.000]  I've heard you guys talk about training
[01:21:02.080 --> 01:21:02.800]  leaking
[01:21:04.860 --> 01:21:08.060]  into your training data during training.
[01:21:08.060 --> 01:21:09.720]  How does that happen?
[01:21:16.640 --> 01:21:19.060]  One of the ways it could happen is
[01:21:19.060 --> 01:21:23.660]  you could have duplicate samples
[01:21:23.660 --> 01:21:27.320]  that are, normally you just say, pick a random
[01:21:27.320 --> 01:21:30.460]  set of indices or pick the last 20% of my indices
[01:21:31.800 --> 01:21:35.300]  and that's my test set. Or you do a cross-validation
[01:21:35.300 --> 01:21:38.700]  where you divide it up somehow.
[01:21:38.900 --> 01:21:43.380]  If you had duplicate data, say the last 20% of your data was actually duplicated
[01:21:43.380 --> 01:21:46.880]  with the first 20% of your data and you trained on the first
[01:21:47.480 --> 01:21:50.600]  four fifths and then you tested on the last one fifth
[01:21:51.240 --> 01:21:55.700]  well since it's duplicated it's included in your training set and you do really well.
[01:21:56.100 --> 01:21:59.240]  So you have to make sure that you don't have that sort of issue. And then there's other little
[01:21:59.240 --> 01:22:00.720]  things that you could have.
[01:22:02.960 --> 01:22:05.020]  It's basically a data cleaning issue
[01:22:06.560 --> 01:22:08.860]  a lot of the times. Sometimes it's a weird bug
[01:22:09.520 --> 01:22:12.700]  in your code. Sometimes you can just
[01:22:12.700 --> 01:22:17.160]  reverse the model itself and you'll say, oh, show me
[01:22:17.160 --> 01:22:20.620]  what this looks like and then it'll just print out an example
[01:22:20.620 --> 01:22:22.940]  from the training set. Nice.
[01:22:23.760 --> 01:22:25.540]  So we have early stopping
[01:22:26.380 --> 01:22:28.000]  twice. Early stopping
[01:22:29.940 --> 01:22:33.660]  is super nice. So if you have a really long running task and your
[01:22:34.600 --> 01:22:38.060]  model is not improving, early stopping will just stop your model.
[01:22:39.280 --> 01:22:41.580]  Part of me doesn't like it. I just feel like
[01:22:41.580 --> 01:22:45.840]  it could get better. But this is a tiny
[01:22:45.840 --> 01:22:48.460]  model so we don't really mind.
[01:22:50.200 --> 01:22:53.740]  Callbacks. So if you want to use TensorBoard, whatever.
[01:22:53.740 --> 01:22:57.660]  Batch size. This would be the frequency
[01:22:57.660 --> 01:23:02.580]  at which updates to the waits will happen.
[01:23:03.040 --> 01:23:04.700]  Epochs. This is just time
[01:23:05.580 --> 01:23:09.600]  that we're going to do. Sorry, the number of times that we're going to
[01:23:09.600 --> 01:23:11.120]  run through.
[01:23:18.000 --> 01:23:21.780]  We can train. It's obviously
[01:23:21.780 --> 01:23:24.320]  seems pretty accurate. I don't know
[01:23:25.480 --> 01:23:28.340]  I haven't actually tried to optimize this
[01:23:28.340 --> 01:23:31.440]  at all. But what do you guys do when
[01:23:32.420 --> 01:23:36.560]  a model trains really accurately at first? Do you think it's beneficial
[01:23:36.560 --> 01:23:40.420]  that you would try and overfit a model
[01:23:40.420 --> 01:23:44.280]  at first? Because that would at least indicate it
[01:23:44.280 --> 01:23:45.740]  could learn something.
[01:23:48.300 --> 01:23:51.880]  Rich and I both get very paranoid when the thing does too well.
[01:23:54.280 --> 01:23:56.220]  You've got an accuracy of 100%
[01:23:56.220 --> 01:24:00.500]  but something's wrong. Something must be wrong.
[01:24:00.680 --> 01:24:02.280]  Yes, something is wrong.
[01:24:02.820 --> 01:24:07.540]  There could be any number of things. This one says version 1
[01:24:07.540 --> 01:24:11.960]  and I'm giving you
[01:24:11.960 --> 01:24:16.280]  the code so you can go and recreate the model.
[01:24:16.660 --> 01:24:19.980]  Do whatever you need to and I would love to see
[01:24:22.200 --> 01:24:23.520]  I'd love to see
[01:24:23.520 --> 01:24:28.420]  some blog post or some write up about everything I missed.
[01:24:29.980 --> 01:24:32.640]  So this is a visualization of the training loss.
[01:24:32.760 --> 01:24:36.060]  Obviously with the gradient going down you would like to see
[01:24:36.060 --> 01:24:38.460]  this going down. If this were inverted
[01:24:40.300 --> 01:24:44.100]  that would indicate that, well it could indicate a number of things
[01:24:44.100 --> 01:24:48.720]  but either it's not learning or it's overfitting.
[01:24:48.720 --> 01:24:51.820]  And it's stopped learning completely.
[01:24:52.640 --> 01:24:55.280]  There are some evaluation metrics.
[01:24:56.240 --> 01:25:00.060]  Obviously it's going to be like, yeah I did really well on this.
[01:25:01.280 --> 01:25:04.680]  As we saw up here. So that's a little suspect.
[01:25:04.900 --> 01:25:08.760]  But even after you have a model, I like
[01:25:08.760 --> 01:25:12.840]  to pull out a couple of the best
[01:25:12.840 --> 01:25:16.980]  and worst examples of a category.
[01:25:16.980 --> 01:25:19.140]  So we're just taking the first
[01:25:22.000 --> 01:25:22.880]  malicious document
[01:25:25.840 --> 01:25:26.720]  because we
[01:25:29.980 --> 01:25:32.840]  put them in a list such that they were malicious
[01:25:32.840 --> 01:25:36.120]  and then non-malicious. So we're taking the first, we're taking the second
[01:25:39.280 --> 01:25:41.000]  malicious document and then we're going to take the
[01:25:41.000 --> 01:25:45.420]  last non-malicious and we're just going to try and predict and see what they are.
[01:25:46.120 --> 01:25:48.780]  And yeah, they're
[01:25:50.680 --> 01:25:53.500]  a little too accurate for my liking.
[01:25:53.820 --> 01:25:57.800]  But yeah, that's kind of it.
[01:25:57.860 --> 01:26:01.260]  Okay, so this is kind of where we leave you. So the first
[01:26:01.260 --> 01:26:05.280]  Excel file that you're looking at here, when I did it on
[01:26:05.280 --> 01:26:09.800]  all 380,000 scripts, it took me
[01:26:09.800 --> 01:26:13.680]  well, and this would depend on whatever kind of potato you're running
[01:26:13.680 --> 01:26:17.320]  but it took me 17 hours to do.
[01:26:17.840 --> 01:26:21.780]  So we're probably not going to run it here. You might already be running it
[01:26:21.780 --> 01:26:25.500]  but have them reset a few times because you saw your thing froze.
[01:26:26.080 --> 01:26:29.600]  But effectively what we're doing is we're just toggling absolutely every
[01:26:31.860 --> 01:26:33.500]  possible combination and then
[01:26:33.500 --> 01:26:36.520]  making a base prediction and then a new prediction
[01:26:37.700 --> 01:26:41.660]  and then we're keeping a cumulative sum of those. And what you
[01:26:41.660 --> 01:26:45.360]  end up getting is a spread of scores
[01:26:45.360 --> 01:26:49.540]  across a number of predictions
[01:26:50.560 --> 01:26:53.120]  that will sort your tokens into
[01:26:53.570 --> 01:26:57.580]  malicious and non-malicious. This is the same
[01:26:57.580 --> 01:27:01.760]  code from the Proofpoint research. The Proofpoint research was
[01:27:01.760 --> 01:27:05.120]  easier in this regard because they had a wider
[01:27:06.780 --> 01:27:09.600]  range of 1
[01:27:09.600 --> 01:27:13.720]  to 999, where these are hard labels and it's a 0
[01:27:13.720 --> 01:27:17.520]  or a 1. As a first run through
[01:27:17.520 --> 01:27:21.140]  looking at the most malicious and the least malicious tokens
[01:27:22.240 --> 01:27:25.520]  without any optimization, if it were ever
[01:27:25.520 --> 01:27:29.860]  going to work, I would expect to see
[01:27:29.860 --> 01:27:32.300]  that at least they're being sorted
[01:27:33.160 --> 01:27:37.480]  to some degree. Because now you can go back and you can start tweaking
[01:27:37.480 --> 01:27:40.000]  the model or whatever it might be
[01:27:41.360 --> 01:27:45.480]  to really pull out, really hone in on what's
[01:27:45.480 --> 01:27:48.200]  accurate. I think a lot of times for attackers, this
[01:27:49.180 --> 01:27:53.380]  first version would be just fine. But
[01:27:53.380 --> 01:27:56.000]  as you go into the future
[01:27:57.400 --> 01:28:00.980]  you could even do this. You could always be collecting data
[01:28:02.260 --> 01:28:04.020]  and actually I remember
[01:28:04.020 --> 01:28:08.220]  after my first BSides talk about machine learning
[01:28:08.220 --> 01:28:12.080]  last year, Rob came up to me and was like, hey, you should
[01:28:12.080 --> 01:28:16.320]  have a separate data gathering campaign
[01:28:17.440 --> 01:28:20.060]  so you keep your ops and your
[01:28:20.060 --> 01:28:24.060]  data collection separate, but your data collection can support
[01:28:24.060 --> 01:28:27.100]  your ops and it doesn't necessarily
[01:28:27.940 --> 01:28:32.200]  burn you because collecting data can be
[01:28:32.200 --> 01:28:36.280]  noisy. So it's a bit over time, but do you guys have
[01:28:36.280 --> 01:28:40.580]  any questions? Comments?
[01:28:40.580 --> 01:28:44.360]  What I would love to see is someone to take
[01:28:44.360 --> 01:28:48.640]  this Defender model and absolutely crush it.
[01:28:50.560 --> 01:28:52.320]  I didn't... I mean
[01:28:52.320 --> 01:28:56.280]  it's a tiny model, I'm not sure Microsoft would care that much. There's a VBA
[01:28:57.000 --> 01:29:00.320]  dataset I have not touched. So if you
[01:29:00.320 --> 01:29:04.260]  guys want to race to whatever, I'd love
[01:29:04.260 --> 01:29:06.760]  to see what comes out of it.
[01:29:07.680 --> 01:29:11.940]  I appreciate everybody. Thanks for coming.
[01:29:12.740 --> 01:29:16.080]  Hit me up in the Slack or not Slack, Discord
[01:29:16.080 --> 01:29:20.260]  if you have any questions or you just want to rag on my terrible code
[01:29:20.260 --> 01:29:23.000]  that's fine too, but
[01:29:25.480 --> 01:29:27.980]  I will end it there.
[01:29:29.980 --> 01:29:32.100]  So I think this is the
[01:29:32.100 --> 01:29:36.100]  last stream of the night, so we're closing out the
[01:29:36.100 --> 01:29:40.040]  Twitch stream now. Tomorrow hopefully will go smoother.
[01:29:40.040 --> 01:29:43.080]  We learned some things today, but I'll
[01:29:43.880 --> 01:29:48.120]  see you guys all in the morning. Hopefully.
[01:29:48.320 --> 01:29:49.520]  Hopefully.
